Chapter 11 XML

This chapter shows you how to process the recently released BNC 2014, which is by far the largest representative collection of spoken English collected in UK. For the purpose of our in-class tutorials, I have included a small sample of the BNC2014 in our demo_data. However, the whole dataset is now available via the official website: British National Corpus 2014. Please sign up for the complete access to the corpus if you need this corpus for your own research.

11.1 BNC Spoken 2014

XML is similar to HTML. Before you process the data, you need to understand the structure of the XML tags in the files. Other than that, the steps are pretty much similar to what we have done before.

First, we read the XML using read_html():

Now it is intuitive that our next step is to extract all utterances (with the tag of <u>...</u>) in the XML file. So you may want to do the following:

## [1] "\r\nanhourlaterhopeshestaysdownratherlate"                    
## [2] "\r\nwellshehadthosetwohoursearlier"                           
## [3] "\r\nyeahIknowbutthat'swhywe'reanhourlateisn'tit?mmI'mtirednow"
## [4] "\r\n"                                                         
## [5] "\r\ndidyoutext--ANONnameM"                                    
## [6] "\r\nyeahyeahhewrotebacknobotherlad"

See the problem?

Using the above method, you lose the word boundary information from the corpus.

What if you do the following?

##  [1] "an"      "hour"    "later"   "hope"    "she"     "stays"   "down"   
##  [8] "rather"  "late"    "well"    "she"     "had"     "those"   "two"    
## [15] "hours"   "earlier" "yeah"    "I"       "know"    "but"

At the first sight, probably it seems that we have solved the problem but we don’t. There are even more problems created:

  • Our second method does not extract non-word tokens within each utterance (e.g., <pause .../>, <vocal .../>)
  • Our second method loses the utterance information (i.e., we don’t know which utterance each word belongs to)

So we cannot extract <u> elements all at once; nor can we extract all <w> elements all at once. Probably we need to process each <u> node one at a time.

First, let’s get all the <u> nodes.

## {html_node}
## <u n="1" who="S0024" trans="nonoverlap" whoconfidence="high">
##  [1] <w pos="AT1" lemma="a" class="ART" usas="Z5">an</w>
##  [2] <w pos="NNT1" lemma="hour" class="SUBST" usas="T1:3">hour</w>
##  [3] <w pos="RRR" lemma="later" class="ADV" usas="T4">later</w>
##  [4] <pause dur="short"></pause>
##  [5] <w pos="VV0" lemma="hope" class="VERB" usas="X2:6">hope</w>
##  [6] <w pos="PPHS1" lemma="she" class="PRON" usas="Z8">she</w>
##  [7] <w pos="VVZ" lemma="stay" class="VERB" usas="M8">stays</w>
##  [8] <w pos="RP" lemma="down" class="ADV" usas="Z5">down</w>
##  [9] <pause dur="short"></pause>
## [10] <w pos="RG" lemma="rather" class="ADV" usas="A13:5">rather</w>
## [11] <w pos="JJ" lemma="late" class="ADJ" usas="T4">late</w>

Take the first node in the XML document for example, each utterance node includes words as well as non-word tokens (i.e., paralinguistic annotations <pause ...></pause>). We can retrieve:

  • words in an utterance
  • lemma forms of all words in the utterance
  • pos tags of all words in the utterance (BNC2014 uses UCREL CLAWS6 Tagset)
  • paralinguistic tags in the utterance
##  [1] "an"     "hour"   "later"  ""       "hope"   "she"    "stays"  "down"  
##  [9] ""       "rather" "late"
##  [1] "AT1"   "NNT1"  "RRR"   NA      "VV0"   "PPHS1" "VVZ"   "RP"    NA     
## [10] "RG"    "JJ"
##  [1] "a"      "hour"   "later"  NA       "hope"   "she"    "stay"   "down"  
##  [9] NA       "rather" "late"

Exercise 11.1 Please come up with a way to extract both word and non-word tokens from each utterance. Ideally, the resulting data frame should consist of rows being the utterances, and columns including the attributes of each autterances.

Most importantly, the data frame should record not only the strings of the utterance but at the same time for the word tokens, it should preserve the token-level annotation of word part-of-speech tags (see the utterance column in the table below). A sample utterance-based data frame is provided below.

11.2 Process the Whole Directory of BNC2014 Sample

11.2.1 Define Function

In Section 11.1, if you have figured how to extract utterances as well as token-based information from the xml file, you can easily wrap the whole procedure as one function. With this function, we can perform the same procedure to all the xml files of the BNC2014.

For example, let’s assume that we have defined a function:

read_xml_bnc2014 <- function(xml){
  ...
}

This function takes one xml file as an argument and return a data frame, consisting of utterances and other relevant token-level information from the xml.

Exercise 11.2 Now your job is to write this function, read_xml_BNC2014(xml = "").

11.2.2 Process the all files in the Directory

Now we utilize the self-defined function, read_xml_BNC2014(), and process all xml files in the demo_data/corp-bnc-spoken2014-sample/. Also, we combine the individual data.frame returned from each xml into a bigger one, i.e., corp_bnc_df:

## Time difference of 1.116706 mins

It takes about one and half minute to process the sample directory. You may store this corp_bnc_df data frame output for later use so that you don’t have to process the XML files every time you work with BNC2014.

11.3 Metadata

The best thing about BNC2014 is its rich demographic information relating to the settings and speakers of the conversations collected. The whole corpus comes with two metadata sets:

  • bnc2014spoken-textdata.tsv: metadata for each text transcript
  • bnc2014spoken-speakerdata.tsv: metadata for each speaker ID

These two metadata sets allow us to get more information about each transcript as well as the speakers in those transcripts.

11.3.1 Text Metadata

There are two files that are relevant to the text metadata:

  • bnc2014spoken-textdata.tsv: This file includes the header/metadata information of each text file
  • metadata-fields-text.txt: This file includes the column names/meanings of the previous text metadata tsv, i.e., bnc2014spoken-textdata.tsv.

11.3.2 Speaker Metadata

There are two files that are relevant to the speaker metadata:

  • bnc2014spoken-speakerdata.tsv: This file includes the demographic information of each speaker
  • metadata-fields-speaker.txt: This file includes the column names/meanings of the previous speaker metadata tsv, i.e., bnc2014spoken-speakerdata.tsv.

11.4 BNC2014 for Socialinguistic Variation

Now with both the text-level and speker-level metadata, bnc_text_meta and bnc_sp_meta, we can easily connect the utterances to speaker and text profiles using their unique ID’s.

BNC2014 was born for the study of socio-linguistic variation. Here I would like to show you some naitve examples, but you should get the ideas and the potentials of BNC2014.

11.5 Word Frequency vs. Gender

Now we are ready to explore the gender differences in language.

11.5.1 Preprocessing

To begin with, there are some utterances with no words at all. We probably like to remove these tokens.

11.5.2 Target Structures

Let’s assume that we like to know which adjectives are most frequently used by men and women.

11.5.3 Analysis

After we extract utterances with our target structures, we tokenize the utterances and create frequency lists of target structures, i.e., the adjectives.

  • Female wordcloud

  • Male wordcloud


Exercise 11.3 Which adjectives are more often used by male and female speakers? This should be a statistical problem. We can in fact extend our keyword analysis (cf. Chapter 6) to this question.

Please use the statistics of keyword analysis to find out the top 20 adjectives that are strongly attracted to female and male speakers according to G2 statistics. Please include in the analysis words whose frequencies >= 20 in the entire corpus.

Also, please note the problem of the NaN values out of the log().

11.6 Degree ADV + ADJ

In this section I would like to show you an example where we can extend our lexical analysis to a particular syntactic pattern. Specifically, I like to look at the adjectives that are emphasized in conversations (e.g., too bad, very good, quite cheap) and examine how these emphatic adjectives may differ in speakers of different genders.

Here we define our patterns, utilizing the POS tags and the regular expressions: [^_]+_RG [^_]+_JJ


Exercise 11.4 In the previous task, we have got the frequency list of adjectives by gender, i.e., pattern_by_gender. Please create a wide version of the frequency list, where each row is a word type and the columns include the frequencies of the word in male and female speakers, as well as the dispersion of the word in male and female speakers. A sample has been provided below. Dispersion is defined as the number of speakers who use the adjective at least once.

Exercise 11.5 Following Exercise 11.4, now it should be clear to you that which adjectives are more likely to be emphasized by male and female speakers should be a statistical question.

Please use the statistics G2 from keyword analysis to find out the top 10 emphasized adjectives that are strongly attracted to female and male speakers according to G2 statistics. Please include in the analysis adjectives whose dispersion >= 2 in the respective corpus, i.e., adjectives that have been used by at least TWO different male or female speakers.

Also, please note the problem of the NaN values out of the log().

Exercise 11.6 Please analyze the verbs that co-occur with the first-person pronoun I in BNC2014 in terms of speakers of different genders. Please create a frequency list of verbs that follow the first person pronoun I in demo_data/corp-bnc-spoken2014-sample. Verbs are defined as any words whose POS tag starts with VV.

Also, please create the word clouds of the top 100 verbs for male and female speakers.

Exercise 11.7 Please analyze the recurrent trigrams used by male and female speakers by showing the top 20 four-grams used by males and females respectively ranked according to their dispersions. Dispersion of four-grams is defined as the number of texts where the four-gram is observed.